# README

## Data access

The central data in this project—household-level natural gas bills from PG&E and SoCalGas—are covered by a non-disclosure agreement and cannot be released.

Thus, while we cannot share the data, we can describe the process for requesting the data. We have also uploaded all relevant scripts for cleaning and analyzing the data (also described below).

PG&E (one of the two utilities in our study) also describes the process in this link: https://pge-energydatarequest.com/sites/default/files/Acamdemic_research_requirements.pdf

1. Receive IRB/CPHS approval from covered institutions (or request exemption if you are not planning to receive personally identifying information).
2. Contact the utilities through their data-request programs (see links below that discuss utilities' data request programs).
3. Negotiate terms of NDA—including data covered and security protocols.

Links for data requests

- CPUC rule on data requests: https://docs.cpuc.ca.gov/PublishedDocs/Published/G000/M090/K845/90845985.PDF
- PG&E data request program: https://pge-energydatarequest.com/
- SoCalGas data request program: https://www.socalgas.com/for-your-business/energy-savings/energy-usage-requests
- SDG&E data request program: https://energydata.sdge.com/showDataAccessAndRelease

## Description of scripts

Below we describe the function of each of the scripts that make up our project.

### Analysis scripts

*Note:* Most analysis files require `buildPanelAllZips.R`.

`analyzeAllZips.R`

- Runs analysis a sample of *all* PG&E and SoCalGas zip codes (i.e., not just border discontinuity).

`analyzeAllZipsOLS.R`

- Samples from households with at least 12 bills.
- Keeps bills between 28 and 34 days
- Does not limit to 2010–2014 (PGE covers 2002–2014; SoCal covers 2010–2015)
- Estimates elasticities via OLS by zip.

`analyzeBalance.R`

- Checks 'balance' across utilities in main sample.

`analyzeCareIncome.R`

- Analyzes (via regression and plot) the relationship between zips' income and share CARE.
- Zip-level income data come from California Franchise Tax Board.
- Share CARE come from our billing data.

`analyzeCommon.R`

- Main elasticity analysis of zip codes common to both utilities (with and without controlling for HDDs).
- Also: Analysis of zip codes neighboring the zip codes that are common to both utilities.
- Analysis considers three dimensions:
  1. Five types of price (marginal, average, average marginal, baseline, simulated)
  2. Prices' leads/lags (e.g., contemporaneous vs. lagged)
  3. Level spatiotemporal fixed effects

`analyzeCommonHeterogeneity.R`

- Estimate heterogeneity by income, season, and their interaction.
- Similar in considerations to `analyzeCommon.R` (e.g, types of price, leads/lags, FEs, HDDs).

`analyzeCommonRobustness.R`

- Runs additional robustness checks to results in `analyzeCommon.R` (controls, FEs, lags).
- Also tests the simulated instrument.

`analyzeInstrumentHDD.R`

- Runs alternative instrument strategy where the instrument is population-weighted HDDs in the eastern US (conditional on western HDDs)
- For appendix—referee request

`analyzeVariationSources.R`

- Runs a sequence of analyses to determine which sources of variation drive our estimates.
- Starts with 'standard' (main) specification.
- Test 1: Shut down the border discontinuity
- Test 2: Shut down the within-month-utility price variation by interacting day-of-month with city-YM FEs
- Test 3: Shut down the within-month-utility price variation using the first (or median) price observed for each utility-city-YM

`testJointHypotheses.R`

- Run joint tests in heterogeneity results.

### Build scripts

`buildAnalysisDatasets.R`

- Build main analysis dataset of cities/zip codes common to both utilities.

`buildBaselinesPGE.R` and `buildBaselinesSoCal.R`

- Clean utilities' baseline quantity files.
- Raw files downloaded from utilities' websites.

`buildDegreeDaysNARR.R`

- Calculate heating degree days (HDD) and cooling degree days (CDD) by day and county (for CONUS).
- Data
  1. 2010 Census shapefile of county outlines
  2. Average daily temperature data NCDFs from NOAA's NARR

`buildHenryHub.R`

- Build weekly time-series of Henry Hub Spot Price.
- Part of the main instrument in the paper.

`buildPanelAllZips.R`

- Sample a percentage of households from each PGE and SoCal zip code.
- We eventually use this sample to estimate 'all zips' elasticities.

`buildPanelCommonZips.R`

- This script aggregates the individual zip-by-utility billing datasets.
- INPUT: zip-utility (.rds) billing data created by buildPanelPGE.R and buildPanelSoCal.R. These files have consumption and price information.
- OUTPUT: a single dataset of consumption and prices for the cities/zip codes served by both PG&E and SoCalGas.

`buildPanelPGE.R` and `buildPanelSoCal.R`

These scripts combine consumption, price, and weather data for PG&E's and SoCal's service areas (by zip code).

- INPUT: price, consumption, and weather datasets
- OUTPUT: zip-by-utility datasets (in DataR/Prices/PGE or */SoCal)
- NEXT SCRIPT: `buildPanelCommonZips.R`

`buildZipCodeNeighborGroups.R`

- Builds analysis datasets that "expand" the area of focus.
- Main analyses rely on zip codes served by *both* utilities.
- This script adds zip codes that border the multi-utility zips. And then adds their neighbors. And then their neighbors...

`buildZipSummaries.R`

- Summarizes billing data for each zip code in the sample (baseline allowances, therms, bill amounts, number of bills, share CARE, etc.)

`buildZipUtilityLists.R`

- Simple script to find which zip codes are served by which utilities.

### Utility scripts

`convertToFst.R`

- Convert older 'rds' files to 'fst' for faster loading.

`intersectPGESoCal.R`

- Summarize billing data for zips served by both utilities.

### Visualization scripts

`makeTables.R`

- Custom function to build regression tables.

`mapDiscontinuity.R`

- Build a map of the utilities' service areas (and their overlap).

`plotDWL.R`

- Plot dead-weight loss figures for paper.

`plotExcessDensities.R`

- Plot densities of excess therms (consumption above baseline allowance) by season-income.

`plotFigures.R`

- Create additional figures for paper (and presentations)
  - Spot, citygate, and baseline prices
  - PG&E rate breakdown time series
  - Henry Hub spot price time series
  - CA residential natural gas
  - PG&E price regime example
  - PG&E and SoCal price regime series
  - Allowance series
  - Climate zone map
  - Common zip code map
  - Extended zip code map
  - Service areas map
  - Calendar example
  - PRISM daily data map (paper figure actually made in QGIS)

`plotGasVsElectricity.R`

- Creates figure that compares time-series (within utility) of gas and electricity prices. 
- Builds figures in terms of levels and differences.
- Also runs regression for complementary analysis.

`tablesCommonZips.R`

- Build tables for main analyses.
- Includes first-stage F tests.

`table_felm.R`

- Another custom-table function.
- This function takes the output from an `felm` model (object) and formats the regression results using `knitr`'s `kable`.